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I. INTRODUCTION 

Truly complex stochastic processes—the inftnitary pro¬ 
cesses pj! whose mutual information between past and 
future diverges—arise in many physical and biological 
systems EHE such as those in critical states. They 
are implicated in many natural phenomena, from the 
geophysics of earthquakes [5j and physiological measure¬ 
ments of neural avalanches 7\ to semantics in natural 
language [5] and cascading failure in power transmission 
grids [9j. Their apparent infinite memory makes em¬ 
pirical estimation and modeling particularly challenging. 
The difficulty is reflected in the computational complex¬ 
ity of inference HD]: the resources required to predict 
and model them diverge in sample size, in memory for 
storing model parameters, and in memory required for 
prediction. Resource scaling, an analog of the venerable 
technique of finite-size scaling in statistical mechanics, 
suggests that for infinitary processes we look for statis¬ 
tical signatures that track divergences. Since resource 
divergences are sensitive to a process’s inherent random¬ 
ness and organization, one hopes that their scaling forms 
are uniquely revealing indicators of process complexity 
and can guide the selection of appropriate models. 

To date, though, there are few tractable constructions 
with which to explore possible general relationships be¬ 
tween prediction, complexity, and learning for infinitary 
processes. One of the few tractable and general con- 
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structions is the class of Bandit processes constructed 
from repeated trials of an experiment whose properties 
are, themselves, varying stochastically from trial to trial 
mm- Even if each individual trial is a realization gen¬ 
erated by a stationary process with finite memory and 
exponentially decaying correlations, the resulting process 
over many trials can be infinitary EHS3- 

Why can the past-future mutual information of Ban¬ 
dit processes diverge? The answer is remarkably simple: 
Bandit processes are nonergodic. More to the point, the 
divergence is driven by memory in the nonergodic part 
of their construction—the mechanism in each trial that 
selects and then remembers the operant ergodic compo¬ 
nent. Here, we use that insight to provide a simple, alter¬ 
native derivation of information divergence for this class 
of infinitary process: a structural complexity scaling that 
directly accounts for nonergodicity. 

Information divergence in Bandit processes has been 
interpreted as reflecting a universal property of learning: 
a unique indicator of the number of process parameters 
[3]. The derivation presented here recovers the connec¬ 
tion between the complexity of parameter estimation and 
divergence in past-future information. However, it also 
identifies other structural features, such as infinitary er¬ 
godic components, that can drive divergences. Thus, in¬ 
formation divergences in Bandit processes reflect partic¬ 
ular structural properties of this class, rather than over¬ 
arching principles of prediction, complexity, and learning 
for infinitary processes. Nonetheless, the issues raised 
highlight the need for a more balanced view of truly com¬ 
plex processes and their challenges. We hope our sim¬ 
plified analysis introduces tools appropriate to further, 
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detailed scaling analysis of both ergodic and nonergodic 
infinitary processes. 

Analyzing structural complexity is often conflated with 
statistical and computation-theoretic approaches to com¬ 
plex processes. To ameliorate this, the next section re¬ 
views these alternatives. Then we move on to construct 
Bandit processes and analyze their structural complex¬ 
ity. We then discuss the results, draw out contrasts with 
computation-theoretic and statistical approaches, high¬ 
light the structural hierarchy of ergodic processes, and 
close with a brief discussion of hierarchical processes with 
nested organization. 


II. PREDICTION, COMPLEXITY, AND 
LEARNING 

There is a relationship between, on the one hand, the 
inherent unpredictability and memory in a process and, 
on the other, the difficulty of learning a model from time 
series samples and predicting the time series. Alterna¬ 
tive framings lead to different views of this relationship. 
There are those that attempt to exactly describe a time 
series, those that try to express persistent regularities, 
and those that consider the consequences for inference. 
Their methods are closely related. 

The Kolmogorov-Chaitin complexity monitors the com¬ 
putational resources—specifically, length of the minimal 
program for a given Universal Turing Machine (UTM)— 
required to reconstruct an individual time series Q3HIB]. 
It is a measure of randomness: A random time series has 
no smaller description than itself. Elaborating on this, 
logical depth |19j and sophistication [20] track comple¬ 
mentary computational resources. Logical depth is the 
number of compute steps the minimal UTM program re¬ 
quires to generate the time series. Sophistication is the 
length of that part of the UTM program which captures 
regularities and organization, effectively discounting the 
time series’ irreducible randomness. All these are uncom- 
putable, though, even if one is given a generative model. 

Fortunately, for a process’ typical realizations the 
Kolmogorov-Chaitin complexity grows linearly with time 
series length, with coefficient equal to Shannon source en¬ 
tropy rate h M (a measure of a process’ unpredictability) 
and offset equal to the statistical complexity C M (a mea¬ 
sure of a process’ memory) [2T1, and references therein]. 
Given a generative model called the e-machine , both the 
entropy rate and statistical complexity are computable; 
if the e-machine is finite, they are calculable in closed 
form [22] , 

We say that h^ : C^, and the finite-time excess en¬ 
tropy discussed later are intrinsic measures of a process’ 
structure, randomness, and organization. By intrinsic , 


we mean that these measures exist independently of the 
amount of data that we have observed. The aforemen¬ 
tioned algorithmic complexities explicitly depend on the 
amount of data seen so far, but if the process is ergodic, 
then algorithmic complexities are also (almost always) 
intrinsic to a process in the limit of an arbitrarily large 
amount of data. 

Such analyses of intrinsic properties should be con¬ 
trasted with how statistical inference approaches com¬ 
plex processes. Statistical learning theory [23] [24] analy¬ 
ses and machine learning complexity controls [25H28] are 
not intrinsic in the sense that they show how to choose 
the best in-class model, but the choice of that class re¬ 
mains subjective. The problem of out-of-class modeling 
always exists as a practical necessity, but it is rarely, if 
ever, tackled directly. Of course, in the happy circum¬ 
stance a correct generative model is in-class, then one 
has identified something intrinsic about a process. This, 
however, begs the question of discovering the class in the 
first place. And, practically, such luck is rarely the case. 
Worse, when they do not work well, complexity controls 
give no prescription for choosing an alternative class. 

Intrinsic complexity characterizations have been most 
constructively and thoroughly developed for finite- 
memory, finite-randomness processes, despite the fact 
that many important natural processes are infinitary. 
The latter include the critical phenomena [29] of statisti¬ 
cal physics and the routes to chaos in nonlinear dynam¬ 
ics [2], to mention only two. They exhibit arbitrarily 
long-range spatiotemporal correlations, infinite memory, 
and infinite parameter space dimension. The relation¬ 
ship between prediction, complexity, and learning is es¬ 
pecially interesting when confronted with infinitary pro¬ 
cesses, and we re-investigate that relationship for noner¬ 
godic Bandit processes. 


III. BANDIT PROCESS CONSTRUCTION 

The simplest construction of a Bandit process is the 
following. Consider the stochastic process generated 
by a biased coin whose bias P is itself a random vari¬ 
able. First, a coin bias p is chosen from a user- 
specified distribution Pr(P); next, a bi-infinite sequence 
x 1 = X-1X0X1X2 ... is generated from a coin with this 
particular bias; then, this is repeated for an arbitrar¬ 
ily large number of such trials; generating an ensemble 
{x 1 , x 2 , x 3 , ...} of sequences at different biases. The pro¬ 
cess of interest is this sequence ensemble. We denote the 
random variable block between times a and &, but not 
that at time 6, as X a -b = X a X a+ i... A'b_i. We suppress 
denoting indices that are infinite. And so, the process of 
interest is denoted A'. To denote the random variable 
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block conditioned on a random variable Z taking real¬ 
ization z we use X a: b\Z = z. So here, the subprocess 
X ; |P = p is that produced by a coin with bias p. 

A single one of these bi-infinite sequences comes from 
an ergodic process that is memoryless in every sense of 
the word. In particular, since in each trial past and fu¬ 
ture are independent, the conditional past-future mutual 
information 7 [X_m:0! Xo : at|P = p] vanishes for any M, 
N , and p. However, each of these bi-infinite chains is 
statistically distinct. The mean number of heads, say, 
in one is very different than the mean number of heads 
in another. For sufficiently long chains, such differences 
are almost surely not the consequence of finite-sample 
fluctuations. 

The overall process X does not distinguish between 
sequences generated by different biased coins. So, by 
making the coin bias a random variable, the past and fu¬ 
ture are no longer independent. Both share information 
about the underlying coin bias p. As we will now show, 
the shared information or excess entropy E(M, N) = 
I[X -M: 0 ; A' 0 : jv] diverges with M and N when P is a con¬ 
tinuous random variable. 


IV. INFORMATION ANALYSIS 

To see why, we abstract to a more general case. What 
follows is an alternative, direct derivation of results in 
Ref. BJ Sec. 4] that, due to its simplicity, lends addi¬ 
tional transparency to the mechanisms driving the diver¬ 
gence. 

Let 0 be a random variable with realizations 9 in 
a (parameter) space of dimension K. 0 has some 
as-yet unspecified relationship with observations X : = 
... X_ 2 , X_ i, Xq , Xi,_We can always perform the fol¬ 

lowing information-theoretic decomposition of the com¬ 
posite process’s excess entropy: 

I[X-M:0', Xq-.n] = I[X-M:0\ Xq-.N | 0 ] 

+/[X_M:o; X 0 : at; 0] . (1) 

The first term quantifies the range of temporal correla¬ 
tions of the observed process given 0, and the second 
term quantifies the dependencies between past and fu¬ 
ture purely due to 0. When the fixed-parameter pro¬ 
cess A'. |0 = 9 is ergodic and the composite process X. 
is not, then Eq. (|T|) can be viewed as a decomposition 
of 7[X_M:oi Xo:Ar] into ergodic and nonergodic contribu¬ 
tions, respectively. 

The second term I[X_m-.o', X 0: at; 0] is a multivariate 
mutual information [30] or co-information }3ll . It is 
closely related to parameter estimation, as expected [3], 
since it provides information about the dimension K of 


0. Standard information-theoretic identities yield: 

7[X_M:0; X 0: jv; 0] = #[©] + H[Q\X_ M :n] 

- 77[0|X_m:o] - -ff[0|Xo ; iv] • (2) 

The first term 77 [0] quantifies our intrinsic uncertainty in 
the bias. When 0 is a continuous random variable, 77[0] 
is a differential entropy. The subsequent terms describe 
how our uncertainty in 0 decreases after seeing blocks of 
lengths M + N, M, or N. 

Altogether, Eqs. (jTJ) and © give: 

7[X_M:0! Xo : Jv] = 7[X_M:0! X 0 -.N 10] + 77[0] 

+ T7[0|X_ M:J v]-77[0|X_M:o] 

- 77[0|X O: iv] ■ (3) 

Thus, assuming one chose a prior with finite entropy 
T7[0], divergences in 7 [X'_m:o! Xo : _/v] can come from di¬ 
vergences in 7 [X_ M:O ;X O: jv| 0] or from divergences in 
77[0|X_M:iv] - 77[0 |X_M:o] - ff[0|X O :Jv]. 

Let’s take the cases covered in Ref. [3J Secs. 4.1-4.4], 
There, 0 consists of the model parameters, 9 are realiza¬ 
tions of 0, and X : |0 = 9 consists of (noisy, potentially 
temporally correlated) sequences generated by the model 
with parameters 9. For instance, 0 could be the firing 
rate of a Poisson neuron and X : |0 = 9 could be the 
time-binned spike trains at bring rate 9. Or, 0 could be 
transition probabilities in a finite Hidden Markov Model 
(HMM) and X. |0 = 9 could be the generated process 
given transition probabilities 9. The result, in any case, 
is a nonergodic process X. constructed from a mixture of 
ergodic component processes X : |© = 9. 

In these examples, the component-process excess en¬ 
tropy I[X_M: 0 ;X 0 :n\O] = (I[X-M: 0 ! Xo : j\r 1 0 = 9])g 
does not diverge with M or N, since finite HMMs have 
finite excess entropy, which is bounded by the internal 
state entropy HIM]- In fact, the excess entropy for many 
ergodic stochastic processes is finite, even if generated by 
infinite-state HMMs. Any divergence in the composite 
process 7 [X_m: 0! X 0: jv] therefore comes from divergences 
in H[0\X_ M:N ] - T7[0|X_ M:O ] - 77[0|X O:JV ]. 

Since the composite process includes sequences x 1 from 
trials with different (9, one’s intuition might suggest that 
Pr(0 = 9\X_m-.o = X-M: o) is multimodal for most 
X-M-.o- However, existing results [55H5B] on the asymp¬ 
totic normality of posteriors carry over to this setting, 
since they essentially rely on the log-likelihood function 
logPr(X_M :0 = X-M-. o|0 = 0) being sufficiently well be¬ 
haved. 

For instance, consider the Bandit process construction 
of Sec. |III| A crude derivation of the asymptotic normal¬ 
ity of Pr(0 = 0|X_M:O = X-m-.o) [32] starts with Bayes 
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Rule: 

Pr(0 = 0 \X- M -.O = X-M-.o) 

= Pr(X_M :0 = a-M:o|e = 9) Pr(0 = 9) 

Pr(X_M :0 = X-M-.O ) 

The denominator Pr(X_M :0 = X-m-.o) is quite compli¬ 
cated to calculate, but this normalization factor does not 
affect the 0-dependence of Pr(© = 6\X-m-.o = X-m-.o)- 
More to the point, the prior’s contribution Pr(© = 9) is 
dwarfed by the likelihood: 

Pr(\X- M .. 0 = X-m-.o\® = 0) 

\—' Ad — 1 it v—' Ad — 1 

= Xi (l — 0) M ~^i=O ** , 

in the large-M limit. Let 9* be the unique maximum 
of Pr(0 = 0\X-M:O = X-M-.o)- 0* = TzYhLo 1 x i + 
0(1/M). Taylor-expanding logPr(0 = 9\X-m-.o = 
X-m-.o ) about 9* suggests that Pr(0 = 9\X-m-.o = 
X-m-.o) is approximately normal in the large- M limit, 
with variance decaying as ~ 1/M. (Any one of the 
many sources [331436] on asymptotic normality of pos¬ 
teriors provides rigorous and generalized statements.) 

Armed with such asymptotic normality, we now 
turn our attention to find the asymptotic form of 
H[Q\X-m-.o = X-m-.o], H[Q |A 0: jv = £o:Jv], and 
H[Q\X-m-.n = X-m-.n] in the large-Af and -N lim¬ 
its. The differential entropy of a normal distribution 
is ^log|det£|, where £ is the covariance matrix; here, 
det £ ~ 1/M. This captures the error distribution for 
each of the K parameters. So, this and asymptotic nor¬ 
mality of the posterior imply that: 

F[0|A_m:o = x-m-.o] ~ -y logM , 
plus corrections of 0(1) in M, and thus: 

H[Q\X-m-.o] -y log M , 

where K is the parameter space dimension. 

At first blush, the result is counterintuitive. In the 
limit that M and N tend to infinity, and we see longer 
and longer sequences X-m-.o, we become more certain as 
to 0’s value. This increasing certainty should mean that 
the conditional entropy H[Q\X_m-.o = X-m-.o] vanishes. 
However, if 0 is a continuous random variable (such as a 
Poisson rate), then H[Q \X _ M:0 = X-m-.o] is a differential 
entropy. As our variance in 0|X_M:O = X-m-.o decreases 
to 0, the differential entropy H[Q\X_m-.o = X-m-.o] di¬ 
verges to negative infinity. It is exactly this well known 
divergence that causes a divergence in I[X_m-.o', Aq-tv] for 
the nonergodic processes we are considering. 


From these results and Eq. (J3|, one has: 

K MN 

I[X-M: O;A O :Jv;0] - — log ^ + ^ . 

And, recalling that the ergodic-component information 
does not diverge, we immediately recover: 

K MN 

I[X-M-.o-,X 0:N ]~-lo gw -^. (4) 

Lower-order terms in M and N include the expected log- 
determinant of the Fisher information matrix for maxi¬ 
mum likelihood estimates of 0 [38] , 

A similar information-theoretic decomposition can be 
used to upper-bound the excess entropy of ergodic pro¬ 
cesses as well. For instance, App. [A] uses a similar de¬ 
composition to show that the temporal excess entropy of 
an Ising spin on a two-dimensional Ising lattice at criti¬ 
cality is finite. 

Logarithmic divergences in excess entropy also occur 
in stationary ergodic processes, such as exhibited at the 
onset of chaos through period-doubling [2]. And, al¬ 
ternative scalings are known, such as power-law diver¬ 
gences [3] Sec. 4.5]. For natural language texts there 
is empirical evidence that the excess entropy diverges. 
One form is referred to as Hilberg’s Law [U EH HO]: 
I[X-N:0', A/):Ar] OC y/N. 

In contrast with Sec. |IV[ s rather direct calculation, 
it is far less straightforward to analyze these power-law 
divergences: 

I[X :0] X 0:N \~N\ (5) 

with 7 € [0,1). While there are results on asymptotics of 
posteriors for nonparametric Bayesian inference, many 
aim to establish asymptotic normality of the posterior; 
e.g., as in Refs. mm- As far as we know, no result 
yet recovers the aforementioned power-law divergence; 
likely, since existing asymptotic analyses avoid the es¬ 
sential singularity for the prior utilized in Ref. [3, Sec. 
4.5] to obtain power-law divergence. 

V. DISCUSSION 

We investigated one large, but particular class of in- 
finitary processes in terms of how information measures 
diverge; recovering, in short order, a previously reported 
logarithmic divergence in Bandit-like process past-future 
mutual information. Practically, this suggests that one 
could use the scaling of empirical estimates of past-future 
information as a function of sequence length to estimate 
a process’s parameter space dimension. Mathematically 
and somewhat surprisingly, the derivation shows that the 
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reason Bandit-like processes exhibit information diver¬ 
gences derives from the role nominally finite-sample ef¬ 
fects (asymptotic normality) play in a framework that 
otherwise assumes arbitrarily large amounts of data. 



FIG. 1. Prediction hierarchy for stationary ergodic processes: 
Each level describes a process class with finite informational 
quantities. A class above finitely models the processes in 
the class below. Classes are separated by divergence in the 
corresponding informational quantity. Moving up the hier¬ 
archy corresponds to it diverging. Example processes that 
are finitely presented at each level, but infinitely presented 
at the preceding lower level. Sofic: typical unifilar HMMs, 
e.g., Even Process [Tj; Generative: typical nonunifilar HMMs 
[52) : Finitary: typical infinite nonunifilar HMMs; Infinitary: 
highly atypical infinite HMMs with long-range memory, e.g., 
the ergodic construction in Ref. [I]. 

Section |IV[ s scaling analysis left open the possibility 
that information divergences can be driven by the ergodic 
components themselves. So, what is known about infor¬ 
mation divergences in ergodic processes? An information 
divergence hints at a structural level in the space of er¬ 
godic processes; a space that is itself highly organized. 
This is seen in the hierarchy of divergences separating 
processes into classes of distinct architecture, depicted in 
Fig. [l] (See also Table 1, Fig. 18, and Sec. 5 in Ref. 
@5].) Processes at each level are distinguished by differ¬ 
ent scalings for their complexity and in how difficult they 
are to learn and predict. 

At the lowest level ( Markov ) are processes described 
by finite e-machines with finite history dependence (fi¬ 
nite Markov order R); e.g., those described by existing 
Maximum Caliber models [HJ or by measure subshifts of 
finite type |45] . Though very commonly posited as mod¬ 
els, they inhabit a vanishingly small measure in the space 
of processes [IB]. At the next level {Sofic) of structure are 
processes described by e-macliines with finite C„. These 
typically have infinite Markov order; e.g., the measure- 
sofic processes. Above this level are processes generated 
by general (that is, nonunifilar) HMMs with uncountable 


recurrent causal states and divergent statistical complex¬ 
ity that, nonetheless, have finite generative complexity, 
Cgen < oo [32]. Processes at the generative level not only 
have infinite Markov order and storage, but also require a 
growing amount of memory for accurate prediction. One 
consequence is that they are inherently unpredictable by 
any observer with finite resources. Note, however, that 
predictability is complicated at all levels by cryptic pro¬ 
cesses m — those with arbitrarily small excess entropy, 
but large statistical complexity. When the smallest gen¬ 
erative model is infinite but the process still has short¬ 
term memory, we arrive at the class of finitary processes 
(E < oo). 

Processes with divergent excess entropy—infinitary 
processes—inhabit the upper reaches of this hierarchy. 
Predicting such processes necessarily requires infinite re¬ 
sources, but accurate prediction can also return infinite 
dividends. We agree, here, with Ref. [3] : the asymptotic 
rate of information divergence is a useful proxy for pro¬ 
cess complexity. Historically, this view appears to have 
been anticipated in Shannon’s introduction of the dimen¬ 
sion rate [48[ App. 7] of an ensemble of functions: 


A = lim lim lim 

S —>0 e—>0 T—too 


N(e,6,T) 
T loge 


where N{e, 5, T) is the smallest number of elements that 
can be chosen such that all elements of the ensemble, 
apart from a set of measure 5, are within the distance e 
of at least one of those chosen. 

However, it is as important to know which process 
mechanism drives the divergence as it is to know the 
divergence rate. Infinitary Bandit processes store mem¬ 
ory entirely in their nonergodic component. Our analysis 
identified the divergence in this memory with the well 
known divergence in the differential entropy of highly 
peaked distributions of vanishing width. Generalizing 
Bandit processes to have structured ergodic components, 
we now see that even finite e-machines trivially gener¬ 
ate infinitary processes when their transition probabili¬ 
ties are continuous random variables. 

Thus, in this case, we also agree that information di¬ 
vergence is a “necessary but not sufficient” criteria for 
process complexity [5]. (Appendix [A[ however, looks at 
critical phenomena in spin systems to call out a caveat.) 
This leaves open a broad challenge to understand the suf¬ 
ficient mechanisms for information divergences. For ex¬ 
ample, we have yet to develop similar informational and 
computation-theoretic analyses for the infinitary ergodic 
processes in Refs. SI 13- 

Looking forward, the simplicity of our structural com¬ 
plexity analysis opens up the possibility to better frame 
information in hierarchical processes [331 Sec. 5], such as 
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the structural hierarchy in biology [39J Fig. 6 ], epochal 
evolution [50], and knowledge hierarchies in social sys¬ 
tems such as semantics in human language [511 . These 
are processes in which multiple levels of mechanism are 
manifest and operate simultaneously and in which each 
level is separated from those below via phase transitions 
that lead to various signatures of informational and struc¬ 
tural divergence. 
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Appendix A: Truly Complex Spin Systems? 

Reference |5] pointed out that many infinitary pro¬ 
cesses do not satisfy intuitive definitions for complexity. 
It suggested that divergence in E is a “necessary but not 
sufficient condition” for a process being truly complex. 
While intuitively compelling, perhaps divergent E is not 
even a necessary condition. Let’s explain. 

Spin systems at criticality are one of the most familiar 
examples of truly complex processes: global correlations 
emerge from purely local interactions [29] , Evidence of 
this complexity appears even if we are only allowed to 
observe a single spin’s interaction with another on the 
lattice. At the critical temperature, the interaction has 
a power-law autocorrelation function; at all other tem¬ 
peratures, the spin’s autocorrelation function is asymp¬ 
totically exponential. The spatial excess entropy of these 
configurations appears to diverge at criticality [52], too. 
However, does the temporal excess entropy E(M, N )— 
roughly, the interaction a single spin with itself at later 
times—also diverge at criticality? 

Surprisingly, the excess entropy of the dynamics of a 
single spin on an Ising lattice is finite, even at the crit¬ 
ical temperature, unless there are nonlocal spatial in¬ 
teractions between lattice spins. Consider evolving the 
lattice configurations via Glauber dynamics for concrete¬ 
ness [25]. That is, spin j’s next state a 3 t+1 is determined 
stochastically by its previous state a 3 t and its effective 
magnetic field h 3 t = In other words, h 3 and a 3 t 


causally shield the past tr { from the future "o^', implying 
that: 

= I WU^t+i\K\ 

< H[a{] . 

Given a finite set of spin values and local interactions, h 3 t 
can only take a finite number of values. Thus, H[h 3 t \ < 
oo, and so: 

a t+l:t+N'i K\\ < H[h{\ 

< oo , 

as well. 

A more familiar example makes this concrete. For the 
standard two-dimensional Ising lattice Jij = J, if i and 
j are nearest neighbors, and J. t j = 0, otherwise. There, 
h 3 t can only take 5 possible values— h 3 £ {0, J, 2 J, 3J, 
and 4J}—giving: 

^ H i h i\ 

< log 2 5 bits . 

The information-theoretic decomposition in Eq. 0 ap¬ 
plies in this particular situation. Here, observed vari¬ 
ables X t are spins a t , and the parameters 0 are re¬ 
placed by h J . The bounds above then directly imply 
that E (M,N) < oo for all M and N. In fact, for 
the standard two-dimensional Ising lattice, we find that 
E(—oo,oo) < 1 + log 2 5 = 3.4 bits. We expect excess 
entropy to diverge only when h° is a continuous random 
variable. This can happen when J is nonzero for an 
infinite number of *’s. However, this necessitates global, 
not local, spin-spin couplings. 

On the one hand, this analysis does not negate E’s 
utility as a generalized order parameter [53] . It is still 
likely maximized at the critical point, even if its tempo¬ 
ral version does not diverge. On the other, our analysis 
shows that phenomena- here, spin lattices with purely 
local couplings—do not necessarily have divergent E even 
when many would consider their dynamics to be truly 
complex when the system is critical. 

At first glance, the analysis contradicts the experi¬ 
ments in Fig. 1 of Ref. :3] for the Ising lattice with 
only local interactions. A more careful look reveals that 
there is no contradiction at all. There, coupling strengths 
were randomly changed every 400,000 iterations, so the 
resultant time series looked like a concatenation of sam¬ 
ples from a Bandit process. The analysis in Sec. |IV| then 
predicts the observed logarithmic scaling in Fig. 1 there 
for N < 25. However, it also implies that E(—oo, N) will 
stop increasing logarithmically at or before N = 400, 000. 
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